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ABSTRACT 

Folksonomies provide a rich source of data to study social 
patterns taking place on the World Wide Web. Here we 
study the temporal patterns of users' tagging activity. We 
show that the statistical properties of inter-arrival times be- 
tween subsequent tagging events cannot be explained with- 
out taking into account correlation in users' behaviors. This 
shows that social interaction in collaborative tagging com- 
munities shapes the evolution of folksonomies. A consensus 
formation process involving the usage of a small number 
of tags for a given resources is observed through a numeri- 
cal and analytical analysis of some well-known folksonomy 
datasets. 

Categories and Subject Descriptors 

H. 3.4 [Information Systems]: Systems and Software; H.3.1 
[Information Storage and Retrieval]: Content Analy- 
sis and Indexing; G.2.2 [Mathematics of Computing]: 

Graph Theory 

General Terms 

Measurement, Theory 
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I. INTRODUCTION 

The science of online social networks has recently become 
a interdisciplinary research field, since the technological en- 
vironment and the number of interacting agents requires 
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the contribution of researchers such as computer scientists, 
physicists and sociologists. A particular example of such so- 
cial systems are folksonomies [12| [7{ [T7||16] , i.e. online com- 
munities of users who, interacting through the World Wide 
Web, collaboratively build large and public knowledge bases 
of discrete resources such as bookmarks, scientific papers 
and digital images. Moreover, folksonomy users participate 
also in the classification of individual resources, by labeling 
each of them with arbitrarily chosen tags, that is, a (typi- 
cally small) number of keywords describing each resource. 

Folksonomies act both as public sources of information 
and as a storage system for single users, who selfishly col- 
lect resources for their own private use. These two tasks 
may push the evolution of these systems in opposite direc- 
tions [13] . As regards the first purpose, the development of 
cooperative behavior among users is crucial. Users have to 
agree on tag semantics, so that the tag-based classification 
of resources be coherent and readable. But, on the other 
hand, the popularity of such communities depends on the 
small effort demanded to users in the addition of elementary 
information units, whose description by tags, though simple, 
is very approximated [18] . Besides, the cultural background, 
the effort and the needs of users vary a lot throughout the 
community. This often generates ambiguous, incomplete or 
incoherent descriptions of the collected information and af- 
fects the whole accessibility of it. 

A fundamental mechanism of consensus building among 
users is imitation. For example, consensus triggers the adop- 
tion of a given tag by many users when describing a resource 
or a whole set of resources, for descriptive or even strategic 
purposes. So, the social patterns of users' interaction reflect 
onto the statistical distribution of tags' usage. A highly 
skewed distribution in the usage of tags has already been 
observed, showing that their occurrences vary over many 
magnitudes |4j 
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This reminds the Zipf law observed 
in written texts 19| , where the occurrence of words is dis- 
tributed according to a power-law. So, the skewness of the 
tag frequency distribution may be generated by endogenous 
mechanisms, or alternatively be the results of the statistical 
properties of the underlying language. 

Less attention has been devoted so far to the statistics 
of tag dynamics. The growth of the vocabulary, i.e. the 



number of distinct tags as a function of time, has been em- 
pirically discovered to be sub-linear in different social tag- 
ging systems, and appropriate models have been developed 
to reproduce such growth rate, along with the frequency dis- 
tribution of tags [2] [9] . 

Frequent and rare tags, of course, occur with different 
inter-arrival times, but a clear picture of correlations of the 
same tags by different users has not been drawn so far. In the 
following, we will study the statistics of inter-arrival times 
in some well-known collaborative tagging system, where the 
large number of users allows a reliable empirical analysis, 
and will try to find evidences in favor or against the pres- 
ence of correlation and collaboration patterns through the 
detection of regularities in the temporal statistics of tags 
arrival. 

Similar analysis have already been performed for other 
data sets, namely texts, showing that the distribution of 
word occurrence is not random and deviation from a Pois- 
sonian picture are present. Fat tails in the distribution of 
word inter-arrival times have been detected in texts, and 
put into relation with the underlying semantics [6j |11| |14| |5J 
[l]. We will focus here on a different kind of word sequence, 
that is, the sequence of tags used by annotating users in 
some web-based social tagging community, to describe the 
relative resources. 

2. DATASETS 

The datasets studied here describe the tagging activities 
in some well-known collaborative bookmarking websites in- 
cluding del.icio.us, Bibsonomy and CiteULike. The data re- 
ports individual tag assignments posted by users in chrono- 
logical order. Each tag assignment is a triplet formed by a 
user, a resource and a tag. Resources are URL in del.icio.us, 
while Bibsonomy and CiteULike collect scientific citations. 
Tags are keywords associated by users to describe resources. 
Each user can assign an arbitrary number of tags to the same 
resource in a single post, so more than one tag assignments 
may come at the same time. 

Such datasets comprises f40306315 tag assignments, with 
2482873 tags for 18778597 resources for the bookmarking 
website del.icio.us. The The CiteULike dataset collects 571340 
tag assignments with 199512 resources and 51080 tags. The 
Bibsonomy dataset includes 671808 tag assignments, with 
206942 resources and 58756 tags. 

3. TAG DYNAMICS: OBSERVATIONS 

Correlations in the behavior of user collaborating in tag- 
ging resources online can be studied by inspecting the tem- 
poral statistics of tag usage. Time, here, is discrete e is 
measured in number of successive posts. For example, one 
can study the inter-arrival time of tags, that is, the time 
length occurring between two subsequent tag assignments 
involving the same tag. If users behave independently, tags 
are added with a constant probability at each time unit. Ac- 
cordingly, the arrival of tag would be described by a Poisso- 
nian process, where each occurrence is uncorrelated from the 
previous one. In this case, inter-arrival times are distributed 
according to an exponential distribution with a well-defined 
average inter-arrival times given by 1//, where / is the tag 
frequency 15]. 

By contrast, observed individual tag inter-arrival times 
distribution shows that inter-arrival times span over all time 
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Figure 1: Tag inter-arrival times distribution W(t) 
in collaborative tagging communities as a function 
oft 
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Figure 2: Stationary tag occurrence distribution in 
collaborative tagging communities. 



scales, with a fat-tailed distribution, as shown in figure [T] 
The number of inter-arrival times of time length t, computed 
over all tags, is a power law W(t) oc t -7 , with 7 ~ 1.3 in 
different tagging systems. 

The latter analysis has been performed on a subset of "sta- 
tionary" tags, that is, tags that occurs throughout the whole 
datasets. This aims to exclude tags that start or stop occur- 
ring in the dataset during the time window covered by it. 
These could be frequently occurring tag with short typical 
inter-arrival time, though their observed frequency maybe 
small because of the partial overlap between the dataset 
time window and their lifespan. Thus, a tag with frequency 
/ is called "stationary" if its first occurrence time and the 
time interval between its last occurrence and the end of the 
dataset time window are both lesser than 1//. 

Nevertheless, this power-law behavior maybe the conse- 
quence of the uneven distribution of tag occurrences, which 
is known to follow a Zipf law. As reported in figure [2] the 
number of tags occurring / times is a power-law P(f) oc 
in several collaborative tagging communities, with /3 ~ 1.7. 

The fat tails in figure [l] may be determined by the large 
number of tags with low frequency (i.e., long inter-arrival 
times). To verify this, one reshuffles the time ordering of 
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Figure 3: Comparison between the tag inter-arrival 
times distribution in the original Del.icio.us dataset 
(stationary tags) and in an artificial one where the 
time-ordering of tag assignments has been randomly 
reshuffled. 



tags by reassigning them to randomly chosen posts. This 
way, time correlations are removed and the distribution of 
inter-times is determined solely by the Zipf law in the fre- 
quency of tags. From now on, we limit our statistical analy- 
sis to the larger Del.icio.us dataset, where richer data allow 
a more reliable statistical analysis. However, the numerical 
and analytical results presented here hold within a reason- 
able approximation in other social bookmarking communi- 
ties. As checked in figure |3j the inter-arrival time distribu- 
tion changes slightly from the power-law behavior described 
above. Therefore, the distribution W(t) is no signature of 
complex correlation patterns. This can also be easily under- 
stood by a simple analytical argument, shown in the next 
section. 

Thus, one should observe individual tag inter-arrival time 
distribution, which, of course, display a poorer statistics. 
Here one finds different patterns for high-frequency tags and 
low-frequency tags. The first display a fast decay in the 
distribution for large values of the inter-arrival time. Rea- 
sonably, tags that occur less frequently display longer inter- 
arrival times with a finite probability. Their inter-arrival 
times distribution decay as a power law for large values of 
At. 

The presence of power laws in the distribution of inter- 
arrival times is often put in strict relation with processes 
taking place in "avalanches", i.e. with long period of stability 
with sudden bursts of activity of all scales of magnitudes, 
limited only by finite size effects, as shown in an example 
reported in figure [5] Scale -invariance in the distribution 
of inter-time distribution corresponds to unpredictability of 
future events, given the past time series [10] . 

By reshuffling tags, time correlations would be removed, 
and the curves corresponding to those plotted in figure [3] 
would exhibit an exponential decay 

4. INTER-ARRIVAL TIMES DISTRIBUTION 

The statistics of inter-arrival times for high-frequency and 
low- frequency tags can be simply related. The inter- arrival 
time distributions of individual tags can be modeled by a 
scaling function depending on the tag frequency /, written 
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Figure 4: The inter-arrival times distribution for in- 
dividual tags "drupal" (33442 occurrences), "presse" 
(5011) and "chm" (999) in the collaborative tagging 
system del.icio.us. 
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Figure 5: The weekly usage of the tag "cue" during 
the two years covered by the del.icio.us dataset, dis- 
playing periods of high activity and sharp activity 
peaks. The x-axis reports the number of weeks since 
1st January 1970 
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Figure 6: The cutoff function g(ft) = ^ f a ^l a plot- 
ted against ft for the inter-arrival times distribu- 
tion of individual tags "drupal" (33442 occurrences), 
"presse" (5011) and "chm" (999) in the collaborative 
tagging system del.icio.us. 

as 

Wftf) <x R(J)t- a g(f*t) (1) 

where Wf(t) is the number of inter-arrival times of time 
length t for a tag of frequency / + 1, and g is a function 
which is constant for low values of the argument and decays 
rapidly after a given cut-off value, i.e. g(x) — 1 for i < 1 
and g(x) — for x ^> 1. Since the total number of inter- 
times for a tag with frequency f + 1 is f, Wf(t) is normalized 
by 

/ W f (t)dt = f. (2) 
Jo 

By using the definition [T] this leads to 

/>00 

/ « Riftf^-V / x- a g(x)dx, (3) 
Jo 

where x = f% so that R(f) = f+'K 1 -'*) . 

The inter-arrival times distribution observed over all tags 
W(t) oc i -7 , which receives contribution by the occurrence 
of tags of all frequencies distributed according to the law [2] 
can be written as 

W(t) = / P(f)W f (t)df, (4) 
Jo 

which, after replacing P and Wf by their functional form, 
reads 

W^xt — - 1 x m - a) - 1+ ^g(x)dx. (5) 
Jo 

. Thus, one obtains the relation tp = ^Ef- By replacing the 
observed values for /3 and 7, the relation yields ip ~ 1 for 
del.icio.us, W(t) oc t' 3-3 and 

W f {t) cc f 2 - a t~ a g(tf) (6) 

. The value of the exponent a ~ 0.75, measured by the 
inter-arrival times distribution in del.icio.us, is verified in 
the figure [6] where the inter-arrival times distributions for 
tags with different frequencies collapse on the same function. 

After reshuffling tag order as described above, since the 
arrival of tags is now a Poissonian process, the distribution 



Figure 7: The distribution of inter-arrival times 
between subsequent tag assignments involving the 
same resource. 

[6] changes into an exponential function, with inter-arrival 
times statistics equal to Wj F) (t) oc /A(/)e A(/)t . A(/) is 
the average inter-arrival times, equal to f/T where T is the 
time length of the observed period. The distribution W(t) 
can thus be computed as 

/"OO 

W(t) = / P(f)f\(f)e- X( ^df (7) 
Jo 

that, by replacing the power-law form of P(f) yields W(t) oc 
t^ -3 , showing that the reshuffling changes only the single- 
tag inter-arrival times distribution but leaves unchanged the 
overall inter-arrival times distribution. Therefore, the power- 
law behavior observed for W(t) depends only on the fre- 
quency skewed distribution and cannot be used to study the 
dynamical properties of tag arrival. 

5. COLLABORATIVE PATTERNS 

The bursty behavior of tagging activities is not in itself 
a signature that complexity arises due to the interaction 
of users. A clearer sign of user cooperation can be found 
by analyzing the temporal pattern corresponding to indi- 
vidual resources. Inter- Arrival times t between subsequent 
tagging of the same resource are distributed according to a 
power-law with a sharp cut-off for large values of t going 
to infinity for less tagged resources, as displayed in figure 
[7| Since a user cannot tag a resource twice, the fact that 
individual resource are tagged in "avalanches" depends on 
the contribution of many users. By contrast, if users were 
tagging independently one from each other, t should be dis- 
tributed as an exponential random variable, as happens for 
Poissonian processes. The individual resource inter-arrival 
distribution can be analyzed as done above for tags, showing 
that resources are tagged in bursts spanning all time length 
scales. 

A similar avalanche-like pattern can be observed by tak- 
ing into account only the first usage of a tag by each user, 
and observing the inter-arrival times distribution W/^ii) 
of this special tagging events, where ti refers to their time 
separation and /1 is the number of such events. This way, 
one removes the possibility that the the short inter-arrival 
times are originated by users who often use a given tag for 
their own interests, and long inter-arrival times may come 



• 


blog 


□ 


google 





osx 


A 


debug 


< 


cindy 


V 


highlighting 


► 


dealership 


+ 


yunnan 



tf 



Figure 8: The distribution of inter-arrival times be- 
tween the first usage of some tags by each user, di- 
vided by the number of such events /i, plotted as a 
function of ti/i. 



by the numerous users who seldom tag resources with that 
particular tag. If this was the case, the skewed distribution 
would just be the result of the superposition of heteroge- 
neous, yet independent, usage patterns. Interestingly, the 
distribution of inter-arrival times of a given tag, when one 
limits the observation to the first usage of that tag by each 
user, follows the same statistics observed above when one 
takes into account the whole tagging activity. In particular, 
the relation reported in eq. [6] holds also for the inter-arrival 
times ti, as shows the collapse reported in figure [8] If rela- 
tion [6] holds, tags are "discovered" by users in a correlated 
and bursty manner. 

However, this is not yet a proof of cooperation among on- 
line users. In fact, bursts of attention may arise by both a 
direct mutual influence between users one on each other; oth- 
erwise, users may independently be influenced by the same 
sources of information and news, where attention bursts may 
originate without any interaction among them. 

The stream of tag assignment involving a given resource, 
though, carries a clearer evidence of users interaction. By 
plotting the number of distinct tags, i.e. the vocabulary, 
used for a resource as a function of the number of tag as- 
signments to it, one observes a sub-linear vocabulary growth: 
so, the pace at which new tags are introduced by users to 
describe a resource decreases with time, and new tags are 
introduced less and lesser. In other words, users tend to em- 
ploy the same tags used by previous peers when describing 
the same resources. 

The figure[9]shows that the sub-linear relation between tag 
assignments and number of distinct tags involving a single 
resource holds for the large majority of them. Interestingly, 
this relation is not respected by "spam" bookmarks, that 
is, by tag assignments violating of the collective agreement 
about tag semantical organization. As other signatures of 
complex features, so, this relation may reveal useful in meth- 
ods of spam detection pT 



A deeper insight into the collective development of a tag 
vocabulary associated to a resource is provided by studying 
the Inverse Participation Ratio (IPR) of tags in such vo- 
cabulary. Let Vii—i _ be the components of a vector v, 
such that Yl =i jv v i = ^ ne definition of IPR is IPR = 
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Figure 9: The number of distinct tags n(f) assigned 
to a resource as a function of the number / of tag 
assignments involving the resource. The majority 
of resources lie approximately on the line n(f) oc t° ' 7 . 
The plot shows also two clusters of resources (in the 
ellipses) lying away from the line. A direct inspec- 
tion reveal that that resources are malicious "spam" 
bookmarks. 



X/i=i N v i- If an components are equal to Vi = 1/v N, IPR 
is equal to 1/N. Conversely, if all components are null but 
one, IPR — 1. So, the IPR describes the number of a vector 
components that contribute significantly to the vector norm. 
Analogously, the IPR of tag streams computed on the rel- 
ative frequency of tags represents the number of significant 
tag used to describe a given resource. 

As shown in figure |10| the number of significant tags is 
rather constant even for resources tagged thousands of times, 
showing that a consensus is reached among users about how 
to describe a given resource and following users do not add 
new significant tag but rather employ already used ones. 

6. CONCLUSIONS 

We have studied empirically and analytically the behavior 
of users in some well-known collaborative tagging online sys- 
tems, where a large number of users collects resources and 
classify them by attaching a number of labels, called tags, 
to each of them. A resource can be tagged by many users, 
and thus be tagged by a large number of labels. 

We have analyzed the statistics of inter-arrival times of 
tags, i.e. the time interval occurring between two subse- 
quent occurrences of a same tag, and of resources. We have 
uncovered non-trivial statistical properties, which can be re- 
lated to avalanches in the tagging activities. Such bursty 
behavior shows that the tagging activity by different users 
is strongly correlated. Regularities in the inter-arrival times 
distribution are studied analytically, so that the dynamics of 
rare and frequent tags can be unified by a unique law, which 
depends only on the frequency parameter /. Moreover, we 
have shown that users of tagging systems find a consen- 
sus about the tag description of each resource. In fact, we 
have empirically shown that the number of significant tags 
for each resource is rather constant, even for resources that 
have been tagged by thousand of heterogeneous users. A by- 
product of our analysis regards the detection of spam in such 
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Figure 10: The IPR of tag streams associated to 
individual resources as a function of the number of 
tag assignments / involving them. 

freely accessible communities. The number of distinct tags 
attached to a resource, i.e. the resource vocabulary length, 
grows sub-linearly with the number of tagging events involv- 
ing that particular resource, with a relation which holds with 
good precision for a large majority of tags. Two well-defined 
subset of tags, however, do not satisfy such relationship be- 
tween the resource occurrence and the resource vocabulary 
length. A direct inspection of such tags reveals that the lat- 
ter have been added during malicious spam activity. This 
suggests a fast method to detect spam in collaborative tag- 
ging systems. 
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